Stemming Indonesian
نویسندگان
چکیده
Stemming words to (usually) remove suffixes has applications in text search, machine translation, document summarisation, and text classification. For example, English stemming reduces the words “computer”, “computing”, “computation”, and “computability” to their common morphological root, “comput-”. In text search, this permits a search for “computers” to find documents containing all words with the stem “comput-”. In the Indonesian language, stemming is of crucial importance: words have prefixes, suffixes, infixes, and confixes that make matching related words difficult. In this paper, we investigate the performance of five Indonesian stemming algorithms through a user study. Our results show that, with the availability of a reasonable dictionary, the unpublished algorithm of Nazief and Adriani correctly stems around 93% of word occurrences to the correct root word. With the improvements we propose, this almost reaches 95%. We conclude that stemming for Indonesian should be performed using our modified Nazief and Adriani approach.
منابع مشابه
Indexing the Indonesian Web: Language Identification and Miscellaneous Issues
Information retrieval tools and search engines have mainly been leveraging research results and technologies developed for the English language. In this paper we report the issues and obstacles we met in the process of designing and developing a search engine for the Indonesian language, as well as our progress and results. The results include original contributions such as a grammar for stemmi...
متن کاملIntellectual Pilgrimages and Local Norms in Fashioning Indonesian Islam
Muslims living in the Indonesian archipelago have long placed considerable importance on their travels to and communications with what they saw as intellectual centers for the study of Islam. I trace some of the effects of these “intellectual pilgrimages” to Mecca, Cairo, and elsewhere on Indonesian deliberations about Islam, particularly concerning Islamic law. I argue that these references to...
متن کاملModified Grapheme Encoding and Phonemic Rule to Improve PNNR-Based Indonesian G2P
A grapheme-to-phoneme conversion (G2P) is very important in both speech recognition and synthesis. The existing Indonesian G2P based on pseudo nearest neighbour rule (PNNR) has two drawbacks: the grapheme encoding does not adapt all Indonesian phonemic rules and the PNNR should select a best phoneme from all possible conversions even though they can be filtered by some phonemic rules. In this p...
متن کاملLemmatization Technique in Bahasa: Indonesian Language
many researches and inventions have been made in the field of linguistics and technology. Even so, the integration between linguistics and technology is not always reliable to all language. Every language is unique in its linguistic nature and rules. In this paper, a lemmatization technique in Bahasa (Indonesian language) is presented. It has achieved good precision by using The Indonesian Dict...
متن کاملAutomatic Learning of Stemming Rules for the Indonesian Language
We present a method for the automatic learning of stemming rules for the Indonesian language. The learning process uses an unlabelled corpus. In the first phase the candidate (word, stem) pairs are automatically extracted from a set of online documents. This phase uses a dictionary but is nevertheless not trivial because of morphing. In the second phase the rules are induced from the thus obtai...
متن کامل